Goto

Collaborating Authors

 simulated environment



OFCOURSE: A Multi-Agent Reinforcement Learning Environment for Order Fulfillment

Neural Information Processing Systems

The dramatic growth of global e-commerce has led to a surge in demand for efficient and cost-effective order fulfillment which can increase customers' service levels and sellers' competitiveness. However, managing order fulfillment is challenging due to a series of interdependent online sequential decision-making problems. To clear this hurdle, rather than solving the problems separately as attempted in some recent researches, this paper proposes a method based on multi-agent reinforcement learning to integratively solve the series of interconnected problems, encompassing order handling, packing and pickup, storage, order consolidation, and last-mile delivery. In particular, we model the integrated problem as a Markov game, wherein a team of agents learns a joint policy via interacting with a simulated environment. Since no simulated environment supporting the complete order fulfillment problem exists, we devise Order Fulfillment COoperative mUlti-agent Reinforcement learning Scalable Environment (OFCOURSE) in the OpenAI Gym style, which allows reproduction and re-utilization to build customized applications. By constructing the fulfillment system in OFCOURSE, we optimize a joint policy that solves the integrated problem, facilitating sequential order-wise operations across all fulfillment units and minimizing the total cost of fulfilling all orders within the promised time. With OFCOURSE, we also demonstrate that the joint policy learned by multi-agent reinforcement learning outperforms the combination of locally optimal policies.


Simulated Human Learning in a Dynamic, Partially-Observed, Time-Series Environment

Jiang, Jeffrey, Hong, Kevin, Kuczynski, Emily, Pottie, Gregory

arXiv.org Artificial Intelligence

While intelligent tutoring systems (ITSs) can use information from past students to personalize instruction, each new student is unique. Moreover, the education problem is inherently difficult because the learning process is only partially observable. We therefore develop a dynamic, time-series environment to simulate a classroom setting, with student-teacher interventions - including tutoring sessions, lectures, and exams. In particular, we design the simulated environment to allow for varying levels of probing interventions that can gather more information. Then, we develop reinforcement learning ITSs that combine learning the individual state of students while pulling from population information through the use of probing interventions. These interventions can reduce the difficulty of student estimation, but also introduce a cost-benefit decision to find a balance between probing enough to get accurate estimates and probing so often that it becomes disruptive to the student. We compare the efficacy of standard RL algorithms with several greedy rules-based heuristic approaches to find that they provide different solutions, but with similar results. We also highlight the difficulty of the problem with increasing levels of hidden information, and the boost that we get if we allow for probing interventions. We show the flexibility of both heuristic and RL policies with regards to changing student population distributions, finding that both are flexible, but RL policies struggle to help harder classes. Finally, we test different course structures with non-probing policies and we find that our policies are able to boost the performance of quiz and midterm structures more than we can in a finals-only structure, highlighting the benefit of having additional information.


Uncertainty-Aware Active Source Tracking of Marine Pollution using Unmanned Surface Vehicles

Ma, Song, Wang, Yanchao, Bucknall, Richard, Liu, Yuanchang

arXiv.org Artificial Intelligence

Abstract-- This paper proposes an uncertainty-aware marine pollution source tracking framework for unmanned surface vehicles (USVs). By integrating high-fidelity marine pollution dispersion simulation with informative path planning techniques, we demonstrate effective identification of pollution sources in marine environments. The proposed approach is implemented based on Robot Operating System (ROS), processing real-time sensor data to update probabilistic source location estimates. Experiments conducted in simulated environments with varying source locations, wave conditions, and starting positions demonstrate the framework's ability to localise pollution sources with high accuracy. Results show that the proposed approach achieves reliable source localisation efficiently and outperforms the existing baseline. This work contributes to the development of full autonomous environmental monitoring capabilities essential for rapid response to marine pollution incidents. Pollution discharged into the marine environment causes severe consequences to ecosystems [1], [2] and human health [3].


Simulating Environments with Reasoning Models for Agent Training

Li, Yuetai, Inan, Huseyin A, Yue, Xiang, Chen, Wei-Ning, Wutschitz, Lukas, Kulkarni, Janardhan, Poovendran, Radha, Sim, Robert, Rajmohan, Saravan

arXiv.org Artificial Intelligence

LLM agents excel in compact environments requiring deep reasoning but remain brittle when operating in broader, more complex contexts that demand robustness across diverse tools and schemas. Building bespoke environments for training is heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs can simulate realistic environment feedback without access to actual testbed data or APIs. Inspired by this capability, we propose two frameworks: Simia-SFT, a pipeline that synthesizes SFT data by amplifying small seed sets into diverse trajectories in an environment-agnostic manner, and Simia-RL, a framework that enables RL training without real environment implementations through LLM-simulated feedback. Fine-tuning open models yields consistent improvements across multiple benchmarks, surpassing GPT-4o and approaching o4-mini on $τ^2$-Bench. Together, Simia-SFT and Simia-RL enable scalable agent training without environment engineering, replacing heavy and brittle implementations with flexible LLM-based simulation.


Mano Technical Report

Fu, Tianyu, Su, Anyang, Zhao, Chenxu, Wang, Hanning, Wu, Minghui, Yu, Zhe, Hu, Fei, Shi, Mingjia, Dong, Wei, Wang, Jiayao, Chen, Yuyang, Yu, Ruiyang, Peng, Siran, Li, Menglin, Huang, Nan, Wei, Haitian, Yu, Jiawei, Xin, Yi, Zhao, Xilin, Gu, Kai, Jiang, Ping, Zhou, Sifan, Wang, Shuo

arXiv.org Artificial Intelligence

Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.


Detection of Anomalous Behavior in Robot Systems Based on Machine Learning

Nissan, Mahfuzul I., Aktar, Sharmin

arXiv.org Artificial Intelligence

Ensuring the safe and reliable operation of robotic systems is paramount to prevent potential disasters and safeguard human well-being. Despite rigorous design and engineering practices, these systems can still experience malfunctions, leading to safety risks. In this study, we present a machine learning-based approach for detecting anomalies in system logs to enhance the safety and reliability of robotic systems. We collected logs from two distinct scenarios using CoppeliaSim and comparatively evaluated several machine learning models, including Logistic Regression (LR), Support Vector Machine (SVM), and an Autoencoder. Our system was evaluated in a quadcopter context (Context 1) and a Pioneer robot context (Context 2). Results showed that while LR demonstrated superior performance in Context 1, the Autoencoder model proved to be the most effective in Context 2. This highlights that the optimal model choice is context-dependent, likely due to the varying complexity of anomalies across different robotic platforms. This research underscores the value of a comparative approach and demonstrates the particular strengths of autoencoders for detecting complex anomalies in robotic systems.


Building surrogate models using trajectories of agents trained by Reinforcement Learning

Cestero, Julen, Quartulli, Marco, Restelli, Marcello

arXiv.org Artificial Intelligence

Sample efficiency in the face of computationally expensive simulations is a common concern in surrogate modeling. Current strategies to minimize the number of samples needed are not as effective in simulated environments with wide state spaces. As a response to this challenge, we propose a novel method to efficiently sample simulated deterministic environments by using policies trained by Reinforcement Learning. We provide an extensive analysis of these surrogate-building strategies with respect to Latin-Hypercube sampling or Active Learning and Kriging, cross-validating performances with all sampled datasets. The analysis shows that a mixed dataset that includes samples acquired by random agents, expert agents, and agents trained to explore the regions of maximum entropy of the state transition distribution provides the best scores through all datasets, which is crucial for a meaningful state space representation. We conclude that the proposed method improves the state-of-the-art and clears the path to enable the application of surrogate-aided Reinforcement Learning policy optimization strategies on complex simulators.



MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language

Rincon, Andres Garcia, Ferrante, Eliseo

arXiv.org Artificial Intelligence

This paper presents MinionsLLM, a novel framework that integrates Large Language Models (LLMs) with Behavior Trees (BTs) and Formal Grammars to enable natural language control of multi-agent systems within arbitrary, user-defined environments. MinionsLLM provides standardized interfaces for defining environments, agents, and behavioral primitives, and introduces two synthetic dataset generation methods (Method A and Method B) to fine-tune LLMs for improved syntactic validity and semantic task relevance. We validate our approach using Google's Gemma 3 model family at three parameter scales (1B, 4B, and 12B) and demonstrate substantial gains: Method B increases syntactic validity to 92.6% and achieves a mean task performance improvement of 33% over baseline. Notably, our experiments show that smaller models benefit most from fine-tuning, suggesting promising directions for deploying compact, locally hosted LLMs in resource-constrained multi-agent control scenarios. The framework and all resources are released open-source to support reproducibility and future research.